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Reversibility is a key concept in Markov models and Master-equation models of molecular kinetics. The 
analysis and interpretation of the transition matrix encoding the kinetic properties of the model relies heavily 
on the reversibility property. The estimation of a reversible transition matrix from simulation data is therefore 
crucial to the successful application of the previously developed theory. In this work we discuss methods for 
the maximum likelihood estimation of transition matrices from finite simulation data and present a new 
algorithm for the estimation if reversibility with respect to a given stationary vector is desired. We also 
develop new methods for the Bayesian posterior inference of reversible transition matrices with and without 
given stationary vector taking into account the need for a suitable prior distribution preserving the rneta- 
stable features of the observed process during posterior inference. All algorithms here are implemented in the 
PyEMMA software - http://pyemma.org - as of version 2.0. 


I. INTRODUCTION 

Markov models, Markov state models (MSMs), or 
Master-equation models are a powerful framework to 
reduce the great complexity of bio-molecular dynamics 
to a simple kinetic description that represents the un¬ 
derlying transitions between distinct conformations 1 7 . 
These models allow us to analyze the longest-living 
(metastable) sets of st ruct ured, the effective transi¬ 
tion rates between them®^, the kinetic relaxation pro¬ 
cesses and the ir relationship to equilibrium kinetics 
experiments 11 ^, and the thermodynamics and kinetics 
over multiple thermodynamic states-®*^. A key advan¬ 
tage of MSMs is that they are estimated from conditional 
transition statistics between states, and they thus do not 
require the data to be in global equilibrium across all 
states. As a result, they are an excellent tool to inte¬ 
grate the data of multiple simulation trajectories that 
have been run independently and from different initial 
states into a single informative modeP^HH] 

A variety of complex molecular processes have been 
successfully described using MSMs. Examples in¬ 
clude the folding of proteins into their native folded 
structur e! 20 * 22 * 23 ^ , the dynamics of natively unstructured 
protein^HSU, and the binding of a ligand to a target 

proteirpUm. 

There are two key steps in the construction of a MSM. 
At first a suitable discretization of the continuous con¬ 
formation space has to be obtained. In most cases no 
good a-priori discretization is known and the discretiza¬ 
tion has to be found based on the simulation data. The 
appropriat e choice of discretization is a topic of ongo¬ 
ing researcH- 24 * 31 * 32 ! The error incurred by the discretiza¬ 
tion and by the subsequent approximation of the jump- 
process as a Markov process can be systematically con¬ 
trolled and evaluated' 33 35 . 
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In the second step one estimates the transition proba¬ 
bilities between pairs of states based on the transition 
statistics. The most common approach to estimating 
Markov models from data is by means of a Bayesian 
framework. One first harvests the transition counts, 
Cij(r) from the data, i.e. how often trajectories were 
found in discrete state i at some time t and in discrete 
state j at some later time t + r. The parameter r is 
called lag time and is crucial for the quality of the Markov 
modeP®. Next, one computes the transition matrix ei¬ 
ther by maximizing the likelihood, i.e. the probability 
over all possible Markov model transition (or rate) ma¬ 
trices that may have generated the observed transition 

count^™ 

or by sampling Markov models from the 
posterior distributioiPSHlSl A maximum likelihood esti¬ 
mate gives a single-point estimate, i.e. a single Markov 
model that is “most representative” given the data. How¬ 
ever, if some transition events are rare compared to the 
total simulation length - and this is the typical case in 
molecular dynamics simulation - this maximum likeli¬ 
hood model might be very uncertain and thus far away 
from the model that the one would converge to by in¬ 
creasing amount of simulation data. The Bayesian poste¬ 
rior ensemble is a natural approach to quantify such sta¬ 
tistical uncertainties and thus to make meaningful com¬ 
parisons between a Markov models obtained from differ¬ 
ent sets of simulations, or to experimental data. 

A key property of molecular dynamics at thermal equi¬ 
librium, and a necessary consequence of the second law of 
thermodynamics, is microscopic reversibility of the equa¬ 
tions of motion. This property is ensured by many simu¬ 
lation procedure d 43 * 44 * and carries over to detailed balance 
between discrete states, i.e. formally leads to a (time-) re¬ 
versible Markov model. A reversible Markov model is not 
only physically desirable, it offers statistical advantages 
as it has only about half as many independent parame¬ 
ters compared to a nonreversible model, and it allows the 
equilibrium kinetics to be analyzed in a straightforward 
and meaningful manner. Furthermore imposing detailed 
balance with respect to a given stationary vector can be 
used to aid the efficient estimation of rare-event processes 





















2 


from MSMsP. 

Algorithms imposing detailed balance during likeli¬ 
hood maximization have been discussed in-' 16 37 * 46 [ First 
methods for the sampling the posterior distribution of re¬ 
versible transition matrices have been suggested irP^ and 
later iiP. A method working with natural priors for re¬ 
versible chains was proposed in 48 . The sampling of tran¬ 
sition matrices reversible with respect to a fixed station¬ 
ary distribution was also presented in^S, while a Gibbs 
sampling algorithm with a significantly improved con¬ 
vergence rate has been developed iii^. Ref.S^ discusses 
methods for goodness-of-fit tests for Markov chains. 

This article is deliberately broad and presents new 
concepts, insights and algorithms for reversible Markov 
model estimation in general, maximum likelihood esti¬ 
mators and Bayesian estimators that mutually benefit 
from each other. For this reason we first give a survey of 
principles and consequences of reversible Markov mod¬ 
els. We then extend the framework of maximum likeli¬ 
hood estimation of transition matrices by giving a simpli¬ 
fied maximum likelihood estimator (MLE) for reversible 
transition matrices and a new estimator for reversible 
transition matrices with a fixed equilibrium distribution. 
The main part of the paper comprises new algorithms 
for the full Bayesian analysis of the posterior ensemble 
of reversible Markov models. As yet, three fundamental 
problems have not been satisfactorily solved: (i) How can 
one harvest statistically uncorrelated transition counts 
from trajectories in which subsequent transitions are cor¬ 
related, so as to give rise to meaningful uncertainty in¬ 
tervals? (ii) Which prior should be used in a Bayesian 
analysis so as to get error intervals that envelop the true 
value even for Markov models with many states? (iii) 
How can we design efficient sampling algorithms for the 
reversible posterior ensemble, i.e. algorithms that allow 
to quickly compute reliable error bars for Markov mod¬ 
els with many states? In this paper we discuss (i) give 
a rather complete treatment of problems (ii) and (iii). 
Efficient sampling algorithms are derived for reversible 
Markov models and reversible Markov models with fixed 
equilibrium distribution. 


II. REVERSIBLE MARKOV MODELS 

In this section we will show that microscopic reversibil¬ 
ity carries over to the discretized situation and discuss the 
desirable properties of a reversible Markov state model. 


A. From microscopic reversibility to discrete-state 
detailed balance 


Let /r( x ) denote the equilibrium distribution on the 
microscopic degrees of freedom x £ O, e.g. all-atom co¬ 
ordinates of the system of interest, and let p T (x, y) denote 
the conditional transition density of the MD implementa¬ 
tion. p r (x, y) is the probability that the system is found 


in state y at time t + r given that it has been in state 
x at time t. The MD implementation fulfills microscopic 
reversibility if the following detailed balance equation 

y(x)p T (x, y) = p(y)p T (y, x) (1) 

holds for all pairs of states x,y £ fi. Hence the terms 
“detailed balance” and “reversible” are equivalent in our 
context. Since p{x)p T {x , y) is the unconditional proba¬ 
bility to find the transition (x, t) —> (y, t + t), Eq ([I]) 
means that the system is on average time-reversible - the 
absolute number of transitions from x to y is equal to 
the reverse. Microscopic detailed balance is desirable to 
have in any MD implementation when the aim is to per¬ 
form simulations in thermodynamic equilibrium. If the 
0 would be violated, that would imply the existence of 
cycles x — > y — » 2 — » x along which there is a net trans¬ 
port. Since such cycles could be used to generate work, 
their existence in a system that is driven by purely ther¬ 
mal energy would be inconsistent with the second law of 
thermodynamics. 

Dynamical models that are commonly employed in 
MD implementations fulfill detailed balance. Brownian 
(overdamped Langevin) dynamics fulfills detailed bal¬ 
ance. Hamiltonian and non-overdamped Langevin dy¬ 
namics fulfill generalized detailed balance with respect 
to momentum inversion in phase space, but when inte¬ 
grating over the distribution of momenta they do fulfill 
ordinary detailed balance in position spac^® 1 . In practice, 
some finite time-stepping integrators do not obey exact 
detailed balance with respect to the Boltzmann distribu¬ 
tion, but we here consider that the MD implementation 
has been chosen such that detailed balance is at least 
approximately fulfilled. 

Now suppose that the state space H is partitioned into 
non-overlapping subsets Si, ..., S n that we shall call dis¬ 
crete states here. Each set has an equilibrium probability 
given by 


tT i = / d x p,(x) (2) 

J Si 

and the transition density gives rise to a discrete state 
transition matrix P(r) with entries 


Pij( T ) 


f Si d xf Sj dy p(x)p T (x,y) 
f s , dxp(x) 


( 3 ) 


Using ([2| and Q and microscopic detailed balance, ([!]), 
it is straightforward to verify that 


KiPijix) = n j p ji (T). (4) 

Note that Q holds independently of the choice of the lag 
time t. Moreover, Q implies that 7r is the equilibrium 
probability vector of P(r). By defining the diagonal ma¬ 
trix n = diag(7Ti, ..., 7r„), we can alternatively write Q 
as a matrix equation: 

np = (np) T (5) 

p = n~ 1 p T n (6) 
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constraints 

dof 

ft} 

none 

n{n — 1) 

n 2 

reversible 

\n(n — 1) + n — 1 

W 

reversible, fixed 7r 

| n(n — 1) 

fX 


Table I. Number of independent variables (degrees of freedom, 
dof) and their approximated values for transition matrices 
depending on the constraints. 


As a result, if the microscopic dynamics are reversible, 
the Markov model transition matrix must also be re¬ 
versible. However, a direct estimate of P from a finite 
amount of simulation data cannot be expected to ful¬ 
fill Q exactly. Thus, the principle validity of detailed 
balance motivates us to enforce Q in the process of es¬ 
timating P. 

Enforcing detailed balance helps to reduce the sta¬ 
tistical error of an estimator for P because it reduces 
the number of independent variables roughly by approxi¬ 
mately one half (see Table ,§)• However, there are other 
consequences of having Q: Given detailed balance, we 
can compute the molecular equilibrium kinetics in a phys¬ 
ically meaningful way and employ some useful analysis 
tools that are not defined for nonreversible Markov mod¬ 
els. Moreover, we can employ more efficient and robust 
matrix algebra routines when exploiting that P is a re¬ 
versible matrix. 


B. Eigenvalues and eigenvectors 

Many methods to analyze the molecular kinetics based 
on a Markov model transition matrix rely on the eigen¬ 
value decomposition of P. Using the diagonal eigen¬ 
value matrix A = diag(Ai, ..., A„) we can formulate a 
right eigenvalue problem with right column eigenvectors 
R = (n, ..., r n ), ri £ K" and left row eigenvectors 
L = (h, ... l n ) T ■ 


PR = RA (7) 

LP = A L. (8) 

From 0 , we can obtain a generalized eigenvalue prob¬ 
lem: 


n pr = hra. 

n is symmetric positive definite and as a result of de¬ 
tailed balance, nP is symmetric. Hence, all eigenvalues 
Ai, ..., A ra are real, and the eigenvectors are orthogonal 
with respect to the equilibrium distribution 51 : 

in, rj)„ oc Sij , 

where we have used the weighted scalar product (it, v) n = 
yT TTjUjVi. We can make R orthonormal by scaling an 
arbitrarily obtained eigenvector r, by (r,, r,), r ' . 


Inserting the detailed balance formulation ([ 6 ]) into the 
decomposition Q immediately gives: 

p t hr = hra 
(n r) t p = A(n P) T 

which is a left eigenvalue problem with the choice 

l = (n r) t 

Ij = nr,. (9) 

Thus, detailed balance establishes a 1-to-l relation be¬ 
tween the left and the right eigenvectors. We can decom¬ 
pose the transition matrix into its spectral components 
by just using one set of eigenvectors and the equilibrium 
distribution, such as: 


p = rar t h = Y KnrJ n. (io) 

i =1 

a. Example 1: Consider the following reversible 3 x 
3 transition matrix, 


/ 0.5 0.34 0.16 \ 

P = 0.28 0.5 0.22 . (11) 

\ 0.15 0.25 0.6 / 

Suppose we generate a Markov chain of length 20 starting 
from state 1 , resulting in the count matrix at lag r = 1 : 

/ 4 3 0 

C = 14 3 

\1 1 2 



Now we conduct a nonreversible and a reversible maxi¬ 
mum likelihood estimation of the transition matrix given 
C. Eigenvalues for the exact transition matrix in (11) 
and both nonreversible and reversible estimates for the 
given count matrix in[l2]are shown in Fig. [l] 

It is seen that the nonreversible estimate contains com¬ 
plex eigenvalues. These generally come in complex con¬ 
jugate pairs. Fig. [l] shows a much higher accuracy of 
the reversible estimate compared to the nonreversible es¬ 
timate. In order to explore the statistical significance 
of this observation, we run N = 1000 chains of length 
L = 20 using transition matrix ©• The reversible and 
nonreversible estimation results, together with the true 
eigenvalues, are reported below: 



Ai 

Re{A 2 } 

Im{A 2 } 

Uc{A 3 } 

Im{ A 3 } 

exact 

1 

0.42 

0.0 

0.18 

0.0 

rev 

1 

0.36T0.31 

0.0 

0.18T0.04 

0.0 

nonrev 

1 

0.32T0.29 

-0.04i0.08 

0.07T0.21 

0.04A0.09 


It is seen that the reversible estimates do not only 
have the correct real-valued structure, but can also have 
smaller uncertainties (here especially for A 3 ). This is ex¬ 
pected to be a general result due to the smaller number 
of degrees of freedom in the reversible estimate. 
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Figure 1. Eigenvalues of 3 x 3 example system. The eigenval¬ 
ues obtained from the reversible estimate (green) are a closer 
approximation to the true eigenvalues (red) than the eigen¬ 
values obtained from the non-reversible estimate (blue). The 
unique eigenvalue A = 1 is faithfully reproduced by both es¬ 
timates. 



Real part, Re(A) 

Figure 2. Eigenvalues for alanine dipeptide. The cluster of 
dominant eigenvalues indicates that the slowest processes are 
faithfully reproduced by the non-reversible (blue) as well as 
by the reversible (green) estimate. Only the eigenvalues of 
the reversible estimate are purely real. 


b. Example 2: Fig. [2] shows the distribution of 

eigenvalues from nonreversible and reversible Markov 
models from simulation data for the alanine-dipeptide 
molecule (see Sec. V BI. The eigenvalues of the reversible 
estimate are purely real while the non-reversible estimate 
has eigenvalues with non-zero imaginary part. 


C. Equilibrium kinetics analyses 

Since detailed balance is a consequence of a system 
simulated at dynamical equilibrium, it is not surpris¬ 
ing that detailed balance in the transition matrix P is a 
prerequisite to analyze the equilibrium kinetics given P. 
Since kinetics are related to slow processes we will here 
only consider the m largest eigenvalues Ai, ... A m and as¬ 
sume that they positive. Here are a few examples for 
equilibrium kinetics properties computed from reversible 
transition matrices: 

1. The dominant relaxation rates of the molecular sys¬ 


tem are: 


Hi = 


- - In Ai 

r 


(13) 


where * > 2 (* = 1 has a relaxation rate of zero 
and corresponds to the equilibrium distribution). 
The inverse quantities are the relaxation timescales 
ti = k~ . These rates or timescales are of special 
interest because they are often detectable in kinetic 
experiments such as fluorescence time-correlation 
spectroscopy, two-dimensional IR spectroscopy or 
temperature jump experiments - see^for a discus¬ 
sion. 


2. The decomposition (10) can be used to write ki- 
neti c experi mental observables in an illuminating 
formA 1 l 3 f | 4 l. For example, the long-timescale part 
of the autocorrelation of a molecular observable 
a € R", e.g. containing the fluorescence values of 
every Markov state of a molecule, can be written 
as: 


acf(a; r) ss (a, tt) 2 +^](a, n)le KiT (14) 

i—2 

3. The PCCA+ method for seeking m met astable sets 
of Markov states and its variant^lSH assumes m 
real-valued eigenvalues and eigenvectors. It is thus 
only reliably applicable to reversible transition ma¬ 
trices. 

4. Discrete transition path theorjJsSHIH computes the 
statistics of transition pathways from a set of states 
A to a set of states B given a transition matrix. 
Discrete TPT can be used with nonreversible and 
reversible transition matrices. However, in the re¬ 
versible case we get that the forward committor and 
the backward committor are complementary: 



and the net fluxes, when ordering states such that 
qf < are given by: 

fij = ( 9 / ~Qi)^iPij 

which is analogous to an electric current I = UG 
where I = fjj, U = ( q + — qf) is the potential 
difference and niPij is the conductivity®. 

III. LIKELIHOOD, COUNTING, AND MAXIMUM 
LIKELIHOOD ESTIMATION 

We restate the transition matrix likelihood and for¬ 
mulate the maximum likelihood estimation problem for 
Markov model transition matrices. We present new es¬ 
timation algorithms for reversible Markov models with 
known or unknown equilibrium distribution n. 
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A. Likelihood 

Suppose we have a discrete sequence S = {si, sjy} 
with Si £ {1, n}. If we assume that this sequence is 

the realization of a Markov chain with lag time r = 1, 
the probability that a transition matrix P has generated 
X is proportional to the product of individual transition 
probabilities along the trajectory: 


N 


ns i p )« n 


Ps 


tSt+1 


(15) 


We have neglected the proportionality constant because 
we won’t need them in order to maximize or sample from 
(151. This is very handy because one component in this 


constant is the probability of generating the first state, 
Ps 0 , which is often unknown, but is constant for a fixed 
data set S. 

Suppose we have c (J transitions from i to j. Then 
we can group all Pi 7 terms together and get a factor p^. 
Doing this for all pairs results in the equivalent likelihood 
formulation: 


pm«niK (i6) 

i j 

We can see that the count matrix C £ R nxn is a sufficient 
statistics for the Markov model likelihood P(S | P) - it 
generates the same likelihood although we have discarded 
the information in which sequence the transitions have 
occurred. 

If multiple trajectories are available, their count ma¬ 
trices are simply added up. 


Both cases can in principle be treated with the fol¬ 
lowing formalism: We always obtain the count ma¬ 
trix Cij in a sliding window modc^, i.e. we harvest 
all N — t available transition counts from time pairs 
(1 —> t), (2 —» r + 1), ..., (N — t —> N). Unless S is 
Markovian at lag time 1, we will now harvest more tran¬ 
sition counts than are statistically independent. We can 
formally correct for this by introducing a statistical ineffi¬ 
ciency Iij (r) for every count at a given lag time, such that 
c if( r ) = c ij( T ) is th e effective number of counts, 

resulting in the likelihood 

___ eff 

P(C|P) cc n P% • (17) 

*> 3 

The determination of statistical inefficiencies for univari¬ 
ate signals is well established' 57 ! Determining hj(r) for 
transition count matrices is an open problem. A first 
approach that allows for the first time to estimate con¬ 
sistent, although somewhat too small uncertainty inter¬ 
vals for practical MD data is discussed iiP^. Note that 
the validity of the estimation algorithms described in the 
present paper are independent of the choice of the count 
matrix, such that future methods for estimating the ef¬ 
fective count matrix can be adopted without changing 
the estimation ablgorithms. 


C. Maximum likelihood estimation 


We will now assume that the effective counts are given. 
For better readability we will subsequently omit the su¬ 
perscript eff and just use C = (c^-) to indicate counts. 
Now we ask the question what is the most likely transi¬ 
tion matrix for the observation C, i.e. we seek the maxi¬ 
mum likelihood estimate (MLE) that maximizes (15) over 
the set of transition matrices. 


B. Counting 


How should we count transitions for longer lag times 
r > 1, or if S is not Markovian at lag time r? Regarding 
the first case, if S is Markovian at lag time r, a safe ap¬ 
proach seems to subsample the trajectory at time steps of 
r and then treat the subsampled trajectory as above 7 ®. 
However this approach is statistically inefficient: If S is 
also Markovian for shorter lag times than r, then we are 
using less information than we could. Even if S only 
becomes Markovian at lag times of r or longer, transi¬ 
tions such as 1 —► r + 1 and r/2 —► r + r/2 are usually 
only partially correlated, such that discarding the second 
transition is also not fully exploiting the data. In prac¬ 
tical MD simulations, the lag times required such that a 
Markov model is a good approximation need to be quite 
long (often in the the range of nanoseconds), such that 
subsampling the data at r will create severe problems 
with data and connectivity loss. Regarding the second 
case, if S is not Markovian at lag time r then treating 
every as an independent count is incorrect. 


1. Non-reversible estimation 


It is well known that the non-reversible MLE for the 
transition probability from state i to state j is simply 
given by the ratio of observed counts from i to j divided 

by the total number of outgoing transitions from state 

Ml 


^nonrev 

Pij 


c ifc 


(18) 


We use the hat in order to denote an estimator. The term 
non-reversible implies that reversibility has not been used 
as a constraint in the estimation of p nonrev . Of course 
p nonrev can coincidentally reversible and will be re¬ 
versible if the count matrix C is symmetric. For this 
reason, some early contributions in the field forced sym¬ 
metry in C by counting S forward and backward. This 
practice is strongly discouraged as it will create a large 
bias unless the trajectories used are very long compared 
to the slowest timescales of the molecule. 
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2. Reversible estimation 


Now we consider the problem of finding the reversible 
MLE P rev by enforcing detailed balance (|4j) with respect 
to an unknown equilibrium distribution (7Tj) as constraint 
in the estimation procedure. Note that the count matrix 
used for this approach is not modified, i.e. it comes from 
a forward-only or nonreversible counting and is generally 
not symmetric. The constraints @ can be more conve¬ 
niently handled by defining the new set of variables 

Xij — TtiPij ■ (lb) 

Note that 

Xi =^2xij = 7Ti. (20) 

0 


We can thus recover the transition matrix from X = ( Xij ) 
by: 


Pij = 


Xi 


( 21 ) 


Inserting (21) into (16) and adding constraints for de¬ 


tailed balance and stochasticity leads to the reversible 
maximum likelihood problem: 


maximize 


subject to 




Cij log 


Sfc x ik 


Xij — Xj i 

> o 


( 22 ) 


Xij > 0 

Ignoring the inequality constraints the optimality condi¬ 
tions are 


+ Cji 


_ £i _ Si. _ o 


(23) 


. There is no closed 


•^IJ ' ij l 

with d = J2j c ij an d x is = )T 
form solution when including the detailed balance con¬ 
straint so that (22) has to be solved numerically. One 
option is to directly solve (23) for Xjj and turn it into a 


fixed-point iteration, as first proposed in 37 : 


„(k+i) 


Cji + Cji 


(24) 


(k) 


(k) 


where k counts the iteration number in the algorithm. 
For a starting iterate fulfilling the constraints in (221, 

" (0) = (Cij+Cji)/Ei,j 


for example x\ 


(+ Cji ), the iterates 


will be symmetric and fulfill the inequality constraints 
for all k > 0. 


If we sum over j on both sides of (24) and use (20), we 
can instead reduce the problem to iterative estimation of 
the equilibrium distribution: 


Jk+l) 


n 


E 


Cij + Cji 



The iteration is terminated when 11 7r ( fc + 1 ) — 7r( fc )|| < e . 
The final estimate if is then inserted into (231 to recover 
the reversible transition matrix estimate: 


(Cjj + Cjj ) 7Tj 
Ci 7T j | CjTTi 


(26) 


Note that both the optimum sought by Eq. ( 25|26 ) ex¬ 
hibits pij = 0 if Cij + Cji = 0. Thus, in both optimization 
algorithms, the sparsity structure of the matrix C + C T 
can be used in order to restrict all iterations to the ele¬ 
ments that will result in a nonzero element Pij > 0. 

Furthermore, note that ( 25|26 l are special cases of the 
transition-based reweighing analysis (TRAM) method - 
see RefP, Eqs (29-30) - for the special case of a single 
thermodynamic state. An example for the progress of 
the self-consistent iteration using an alanine dipeptide 
simulation is shown in Fig. [3j 

A different method of iterative solution presented irP 
updates x t j with the exact solution to the quadratic prob¬ 
lem arising from (23) while holding all other variables 
xu fixed. As shown in Fig. [3j this approach can ex¬ 
hibit faster convergence properties than the fixed-point 
iteration (25). 

Uniqueness of the estimator: The optimization 
problem (22) can be equivalently transformed into a con¬ 
vex optimization problem by replacing the decision vari¬ 
ables x^ with Zij = \og(xij) (se^l for details), which 
implies the uniqueness of the maximum likelihood esti¬ 
mator. 


3. Reversible estimation for given stationary vector 

While unbiased MD simulations are useful to estimate 
state-to-state transition probabilities Pij , enhanced sam¬ 
pling algorithms such as umbrella sampling and replica- 
exchange MD can be much more efficient in order to gain 
insight of the equilibrium distribution n. Ref.® demon¬ 
strates how an uncertain estimate of n can be combined 
with unbiased “downhill” trajectories in order to esti¬ 
mate rare event kinetics. A key in such a procedure is a 
way to estimate a reversible Markov model that is most 
likely given transition counts observed from MD simula¬ 
tions, but at the same time has a fixed equilibrium dis¬ 
tribution 7 r. Here we derive a new, efficient estimation 
algorithm for this task. 

Enforcing reversibility with respect to a given station¬ 
ary vector results in the following constrained optimiza¬ 
tion problem 


maximize 

p 

E r ' l lo S Pij 

i,j 

(27) 

subject to 

^^Pij = 1 

i 

(28) 


J 

Pij > o 

(29) 


TtiPij = 7T jPji • 

(30) 


(25) 
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7r can only be the unique stationary distribution of P if 
P is irreducible. To ensure irreducibility, we restrict the 
state space to the largest (weakly) connected set of the 
undirected graph that is defined by the adja cency matrix 
C + C T . For a system with n states, Eqs d27|28|29|30| ) 
is a convex minimization problem in 0(ri z ) unknowns 
with 0(n 2 ) equality and inequality constraints. Solv¬ 
ing this with a standard interior-point method requires 
the solution of a linear system with 0(n 2 ) unknowns to 
compute the search direction at each step. The resulting 
computational effort of 0(n 6 ) operations for solving the 
linear system quickly becomes unfeasible for increasing 
n. Therefore we will propose a fixed-point iteration that 
is also feasible for large values of n. 

To solve the maximization problem, we ignore the in¬ 
equality constraint (29) at first. The row-stochasticity 


constraint (281 is enforced by introducing Lagrange mul¬ 
tipliers A i and adding penalty terms A,; (j2j Pij — lj for 
all i = 1, ..., n to the objective function. The detailed 


balance constraint (30) is included into the likelihood ex¬ 


plicitly by the change of variables 


Pa = 


I Pij if i<j 

I %Pji else 


These substitutions result in the Lagrange function: 

^ = E Cii X °&p'ii + E( c b + c a) 


i<j 


i<j 


( A ' • A ./ ~ ) ~J2\Pu + J2K+ const 


(31) 


that we seek to maximize. By setting the gradient of F 
with respect to all p\ ■ to zero and subsequently reversing 
the change of variables, we find the following expression 
for the maximum likelihood estimate 


Pij = 


( c ij + Cji) 


A / TVj -(- Xj 7Tj 


(32) 


Note the similarity of this equation with the maximum 
likelihood result (26) where tt has been self-consistently 


computed from the counts. The row counts Ci are here 
replaced by the yet unknown Lagrange multipliers Aj. In 
order to find the Lagrange multipliers, we sum Eq. (32) 
over j: 


(°ij + c ji) n j 

A j7Tj + XjTTi 


= 1 


(33) 


This doesn’t give a closed-form expression for A How¬ 
ever, based on this equation, we propose the following 
fixed-point iteration for the Lagrange multipliers: 


A( rl + 1 ) 


E 

■ij +C : 


('Cjj + Cji) ^ 71 j 

\ (n) . x (n) 

A} n, + X\ ’iTj 


Motivated by the analogy between Lagrange multipliers 
and row counts described above, we set the starting point 
to: 


\- 0) — b E( c f? + c ^) 


(35) 


Taking the limit A^ —>• 0 + in (34) still leads to a consistent 


solution. Choosing strictly positive starting parameters 
according to (35) results in valid iterates from (34). In 


analogy to the reversible case we iterate (34), (35) until 
||A( fc+1 ) — A^|| < e. An example for the progress of the 
self-consistent iteration using alanine dipeptide simula¬ 
tion data is shown in Fig. [4| Note that the converges is 
nearly three orders of magnitude faster compared to the 
estimation with unknown equilibrium distribution (Fig. 

§• . 

Given converged Lagrange multipliers, we can exploit 
to find the maximum likelihood transition matrix 


P. For this algorithm the inequality constraints (29) are 


automatically fulfilled when > 0 for all i,j. Care must 
be taken in two situations: (i) A,; = Xj = 0 - one can 
show that the simultaneous limit A, —> 0 + and Xj —> 0 + 
can only occur for + Cji = 0, but then we know that 
Pij = 0. (ii) A 

diagonal element Cu is zero. Depending on the values 
of 7r, the solution A i may take the value of zero such that 
equation (32) for i = j becomes p lt = ca/ A.; = 0/0 which 


is meaningless and is not the correct limit of pa as ca 
goes to zero. However this can be fixed easily by using 
Pa = 1 — Ej^iPij- f° r the diagonal elements of P. In 
summary, we use the following equation for computing P 
from converged Lagrangian multipliers: 


[^j AiTTj+AjTTi * ^ ^ + Af 7^ 0 

Pij = < 0 * 7^ j, Xi + Xj = 0 (36) 

U-E jfrPij i = j 

Uniqueness of the estimator: Since the above esti¬ 
mation algorithm is iterative, it is fair to ask whether the 
estimator P it converges to is unique, or whether there 
might be multiple local maxima that we could get stuck 
in. In this case, it is easy to show that the estimator is 
unique: Let P* be an optimal transition matrix. pF = 0 
exactly if Cij + Cji = 0. Let H = {pij\cij+Cji > 0}. Then, 
the function f(P) = )>/, j logp.y is strictly convex on 
fl and the constraints restrict the solution on a convex 
subset C Q. The minimization of a strictly convex 
function over a convex set has a unique solution. 


(34) 

























Figure 4. Convergence of the reversible maximum likelihood 
estimation with fixed stationary vector. The transition ma¬ 
trix is estimated from an n = 228 state count-matrix ob¬ 
tained from alanine-dipeptide simulation data. Convergence 
is shown for different total simulation lengths. The stationary 
distribution was obtained using the simple counting estimate, 
(831. 


Figure 3. Performance of algorithms for reversible maximum 
likelihood estimation, (a) Reversible transition matrix esti¬ 
mated using the fixed-point iteration (251 from an n = 228 
state count-matrix obtained from alanine-dipeptide simula¬ 
tion data. Convergence is shown for different total simulation 
lengths T. (b) Performance comparison of the direct fixed- 
point iteration (251 and the quadratic optimizer described 
in Ref!' for reversible transition matrix estimation given the 
count matrix C = ((5, 2, 0), (1, 1, 1), (2, 5, 20)) T . Shown is 
the difference of the current likelihood to the optimal likeli¬ 
hood. (c) Same as b, but using the 1734x1734 count matrix 
from Pin WW folding simulations used in Ref. 54 . 
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IV. BAYESIAN ESTIMATION 


value (38) and the posterior variance (39) by 


We introduce new algorithms for sampling the full pos¬ 
terior probability distribution of Markov models, and in 
particular for estimating uncertainties of quantities of in¬ 
terest, such as relaxation timescales or mean first passage 
times. A key in these algorithms is the choice of a suit¬ 
able prior which enforces the sampled matrices to have 
the same sparsity pattern as the transition count ma¬ 
trix, as this allows the credible intervals to lie around the 
true value even for large transition matrices. The rele¬ 
vance of the prior is first demonstrated for nonreversible 
Markov models, for which an efficient sampling algorithm 
is known. We then introduce new Gibbs sampling algo¬ 
rithms for reversible Markov models with and without 
constraints on the equilibrium distribution that vastly 
outperform previous algorithms for sampling reversible 
Markov models. 


A. Bayes theorem and Monte Carlo sampling 

Bayes’ formula relates the likelihood of an observed ef¬ 
fective count matrix C given a probability model P to the 
posterior probability of the model given the observation, 

P(P|C) oc P(P) P(CjP) . (37) 

posterior prior likelihood 


The posterior accounts for the uncertainty coming from 
a finite observation. It incorporates a-priori knowledge 
about the quantity of interest using the prior probability 
P(P). We will see that a suitable choice of the prior 
is essential for the success of a Bayesian description for 
high-dimensional systems. 

In general, if we are interested in an observable that 
is a function of a transition matrix, /(P), we would like 
to compute its posterior moments, such as the mean and 
the variance, 


(/)=|dPP(P|C)/(P), (38) 

Var(/) = J dPP(P\C) (/(P) - (/)) 2 . (39) 

While we usually use the maximum likelihood transition 
matrix P to provide “best” estimates, /(P), the above 
integrals are of interest because a(f) = yWai•(/) gives 
us an estimate of the statistical uncertainty of /. Alter¬ 
natively, we might be interested in the credible intervals 
which encompass the true value of / with some proba¬ 
bility, such as 0.683 (la intervals) or 0.95 (2 a intervals). 
As the integrals ( 38|39 ) are high-dimensional, we need to 
use Monte Carlo methods to approximate them. 

In Monte Carlo methods we generate a sample of tran¬ 
sition matrices {p( fc )}^ i distributed according to the 
posterior and evaluate / at each element P^ in the en¬ 
semble. We then approximate the posterior expectation 


m(f] = ^J2f(P {k) ). (40) 

V k= 1 

1 N 2 

^ 2 [/] = ] v3tE( /( ^ )) —[/]) ( 41 ) 

k -1 

Obtaining good and reliable samples of the posterior 
P(P|C) is very difficult. Previous approaches have suf¬ 
fered from some or all of the following difficulties, that 
are addressed here: 


1. Choice of the prior: Given n Markov states (typ¬ 
ically 100s to 1000s), transition matrices have on 
the order of n 2 elements and are thus extremely 
high dimensional. Most priors used in the past al¬ 
low to populate all these elements , including 
those for which no transition has been observed. 
Although the effect of the prior can be overcome by 
enough simulation data, for any practical amount 
of simulation data, such priors will lead to posterior 
distributions whose probability mass is far away 
from the P true model. This problem has been first 
addressed in Ref 20 by designing a prior that equates 
mean and MLE for nonreversible transition matri¬ 
ces, leading to credible intervals that nicely envelop 
the the true value. Here we design corresponding 
priors for reversible Markov models. 


2. Uncorrelated transition counts C: As dis¬ 
cussed in Sec. (III BI, the likelihood, and thus the 
posterior depends on how transition counts are har¬ 
vested from the discrete trajectories which are gen¬ 
erally time-correlated and not exactly Markovian at 
any particular lag time r. While the MLE is often 
not or little affected by the exact way of counting 
(7, the uncertainties will be dramatically different 
if C is e.g. counted in a sliding window mode (us¬ 
ing transitions starting at all times t = 0, 1, 2, ...), 
or by subsampling the trajectory (using transitions 
starting at all times t = 0, t , 2r). Whereas the first 
approach underestimates the uncertainties, the sec¬ 
ond approach often overestimates them and is often 
not practical for large lagtimes r. Here we suggest 
to use the effective number of uncorrelated tran¬ 
sition counts, C = C eff , and a first approach to 
compute them is described in RefP3. 


3. Efficiency of the sampler: Finally, given a 
choice of prior and C, a sampling algorithm needs 
to explore the high-dimensional space of transition 
matrices in a reasonable time. This is especially 
problematic for reversible Markov models. The 
first Monte Carlo algorithm for sampling the re¬ 
versible posterior, described in RefP“, suffers from 
poor mixing due to small acceptance probabilities 
of the individual steps. In Ref.42, an improved sam¬ 
pler was proposed. Here, we propose sampling algo- 
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rithms for reversible Markov models with and with¬ 
out fixed equilibrium distribution whose efficiencies 
go far beyond previous approaches. 


B. Non-reversible sampling 


Let us first illustrate the effect of prior choice on 
Bayesian estimation of nonreversible Markov models. A 
convenient functional form for the prior is the Dirichlet 
prior 

w«niK ( 42 ) 

i j 

where B = ( bij ) is a matrix of prior-counts. For this 
choice, the posterior is given by 


p(p\c) « n 

i 



(43) 


is the matrix of posterior pseudo-counts. 
In the non-reversible case we can generate independent 
samples from (43) by drawing rows of P ^ independently 
from Dirichlet distributions J{ p 1 with Dirichlet pa¬ 


rameters OLij = Zij + 1 = Cij 


Lj flj 

- bi 




Choosing a uniform prior, bij = 0, assigns equal prior 
probability to all entries, ptj, in the posterior ensemble. 
But this a-priori assumption can lead to serious problems 
when estimating quantities for meta-stable systems. 

Consider for example the following transition matrix 
for a birth-death chain consisting of two meta-stable sets 
A = {1,..., to}, B = {to + 2,..., n}, separated by a 
kinetic bottleneck in form of a single transition state, 



1 

2 


1 

2 


0 


0 

1 

2 


\ 


P = 


i - icr b o 
1 
2 


icr h 

o 

icr b 


1 

2 


0 


1 - 10~ b 


value and its statistical uncertainty from the given sim¬ 
ulation data. Sampling the nonreversible posterior given 
expected counts for a single chain of length L = 10 6 with 
a uniform prior, bij = 0, results in non-zero transition 
probabilities for elements pij which are zero in the true 
transition matrix. As a result artificial kinetic pathways 
circumventing the bottleneck are appearing in the poste¬ 
rior ensemble which lead to a dramatic underestimate of 
the mean first passage time. The Bayesian estimate with 
90% credible interval obtained from 10 3 posterior sam¬ 
ples is [1.9, 2.0] • 10 3 , and thus two orders of magnitude 
smaller than the true value 2 • 10 5 . 

Using the prior b,j = — 1 suggested in RefP^ results 
in 90% credible intervals, [1.5,2.7] • 10 5 , which clearly 
cover the true value 2 • 10 5 . The choice bij = — 1 leads 
to a posterior distribution in which sampled transition 
matrices P have the same sparsity structure as the count 
matrix C, i.e. Pi 3 = 0 if c.; 7 = 0. As count matrices in 
the present context are generally sparse, we call this prior 
briefly sparse prior. Apparently the sparse prior leads to 
consistent credible intervals covering the true value. 


method 

estimate 

true 

2.0 f l ■ 10 5 

uniform prior bij = 0 

1.95?;!] • 10 3 

sparse prior b, 3 = — 1 

2.0?;} ■ 10 5 


Fig. m shows the convergence of the 90% credible in¬ 
terval for the sparse and the uniform prior. The credible 
interval for the sparse prior envelopes the true value al¬ 
ready given little data. To achieve consistency using the 
uniform prior requires simulations order of magnitudes 
longer than the timescale of the slowest process, thus 
rendering inference under this prior unpractical. 

Note that our prior induces a fixed sparsity structure. 
This concept should not be confused with other sparsity 
inducing priors used i.e. in the context of Bayesian com¬ 
pressed sensing 63 } where the sparsity pattern is subject 
to uncertainty. 


C. A prior for reversible Markov models 


V 


1 1 / 

2 2 / 

(44) 


For barrier parameter b = 3 and sets with to = 50 and 
n = 101 the expected time for hitting B from state x = 1 
is 2 • 10 5 steps. Now we are interested in the Bayesian 
estimator for a simulation of length L = 10'. The true 
distribution can be estimated with arbitrary precision by 
repeating the simulation many times. Here, 10 3 repe¬ 
titions led to an estimate of the 90% percentile for the 
mean first passage time of [1.5, 2.7] • 10 5 (see Table be¬ 
low). 

In practice, we cannot afford to repeat the simula¬ 
tion many times but would like to approximate the true 


Now we will present a new method for the sampling 
of reversible transition matrices. In our new approach 
we replace the Dirichlet prior (42) by a new prior for 
reversible sampling. 

Similarly as in reversible maximum likelihood estima¬ 
tion, we define our reversible transition matrix sampler 
in the space of unconditional transition probabilities x%j . 
For convenience we restrict ourselves to the independent 
set of variables with i < j (remember that Xij = Xji for 
reversible matrices), and keep them normalized to 1: 


Xij OC TCiPij 

x ij = i 


(45) 

(46) 


i>j 
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Figure 5. Convergence of the 90% credible interval for the 
sparse prior bij = — 1 and the uniform prior bij = 0. The 
dashed line indicates the true value. The credible interval for 
the improper prior covers the true value orders of magnitude 
before the credible interval for the uniform prior. 


Although X is defined slightly different as in the 
maximum-likelihood case, the mapping from X back to 
P is still given by Eq. ©. We define a prior for re¬ 
versible sampling on the set of X matrices rather than 
on P: Choosing Xjj as the set of independent variables 
has the advantage that obeying detailed balance amounts 
to sampling symmetric matrices, X = X T . 

? (X)cxII4 J - ( 47 ) 

i>j 


Similarly as ii i 39 l 42 l we will construct our Markov chain 
using a Gibbs sampling procedure, where we sample a 
single element of X in each step while leaving the other 
elements unchanged. We repeat this sampling procedure 
for every element of A', thus completing a Gibbs sweep. 
As detailed in the appendix, we can use the following 
general Gibbs step to sample the posterior (481: 


1. Select an arbitrary element Xki ■ Propose a new (un¬ 
sealed) matrix A —> X' by sampling this element 
from the proposal density q(x' kl \X): 


x' ■ 


44; 14) (bi) = (M) 

else 


(49) 


here q{x' kl \X) is an arbitrary, scale-invariant den¬ 
sity. Scale-invariance means that q{x' kl \X) oc 
q(cx' kl \cX) for any positive constant c. 

2. Accept X' as a new step in our Markov chain with 
probability min{l,p acc } where 


Pace — 


(1 - Xki + x' kl ) 


fctil-fao q(xtd | A') 7(4; 4) 
44; 4) r y(x k i\X') ’ 

(50) 


where bg = Y^k>i 4; and 7 is the marginal density: 


7(4; 4) 


, (ait.?*** 4 *** u _ 1 

{xk—Xkk+x' kk ) c k ’ 

(Zki) Ch ' +C,k+bk ‘ V. + , 

(xk-Xkl+x' k i) c k(xi-Xkl+x' kl ) c l > ' 


The posterior for reversible sampling is then given by 

r(.Yic)«nAn(r4\P (48 > 

i>j i,j \X,k x ^J 


Below we will first consider how to sample from (481 us¬ 


ing general prior counts bij. Then we will consider the 
specific choice bij = — 1 for all i < j and show that this 
choice has similar properties as the sparse prior in the 
nonreversible case. 


D. Sampling reversible transition matrices 


There is no known method to generate independent 
samples from the posterior under the reversibility re¬ 
quirement. Instead we will use a a Markov chain Monte 
Carlo (MCMC) method to generate samples from the 
posterior ensuring that each sampled transition matrix 
fulfills the detailed balance condition Q. Our Markov 
chain will generate the ensemble {A^jfcLi via a se t of 
updates advancing the chain from X^ —> Xi fc+1 ) start¬ 
ing from a valid initial state A'(°). We can do a simple 
row-normalization of the X matrices to obtain the de¬ 
sired ensemble Expectation values and vari¬ 

ances will again be estimated using (40|41 ). 


3. Renormalize the matrix X' —» X' such that it ful¬ 
fills (|46]): 


x'.. = 
x l3 


1 - x k i + 4 


(51) 


kl 


While this approach will work for any choice of prior 
counts, we will now use the sparse prior bij = — 1 for all 
i,j with the hope to achieve similarly good results as in 
the nonreversible case. For this choice, ^(x'^X) is scale- 
invariant, i.e. 7(4;|A”) = 'y{cx' kl \cX) 1 and the Jacobian 
pre-factor in (|50| is one. Thus we have: 


7(4;W q{xki\X') 
l{x k i\X’) q( 4; I A) 


(52) 


Thus the ideal choice of the proposal density is q = 7, 
which would guarantee that the acceptance probability 
is always 1. This proposal density degenerates to a point 
probability at zero if c k i + Ci k = 0, which implies bij = 
— 1 encodes a priori belief that any transition for which 
neither the forward direction nor the backward direction 
has ever been observed in the data has zero probability 
in the posterior ensemble. Thus, this prior enforces P 
to have the same sparsity structure as the count matrix, 
like the choice 4 = — 1 for nonreversible sampling. Note 
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that the reversible MLE has the same sparsity structure 
as can be seen from the update rule (24). 

We will choose proposal densities 'y(x' kl \X) that are 
also scale-invariant. In this case the normalization step 
3 above (X' — > X') can be omitted, i.e. if we accept 
X ', we can directly set it as our new sample X^ and 
obtain _p( fc ) by row normalization. We will now outline 
how to design the proposal density 7 such that the ac¬ 
ceptance probability is 1 or nearly 1. For k = l, sampling 
x 'kk ~ l( x 'kk\X) is equivalent to sampling the following 
transformed variable (see Appendix): 


where Af(x\m, s) denotes the Normal distribution of x 
with mean m and standard deviation s. The proposal 
density defined by the above update is 

q( x 'ki\ x ) = -^-Af(log x' kl - \ogx u \0, 1), (64) 

•Ekl 

and the corresponding acceptance probability is: 


= 7(4; I - x ) x 'ki 
l{ x k i\X) x kl 


(65) 


s' =- — — — ~ Beta(c fc fe, c k - c kk ) (53) 

x k - x kk + X kk 

So we can simply define q(x' kk \X) = "f(x' kk \X) and gen¬ 
erate x' kk by 

s ~ Beta(c/cfc, c k c kk ) 

x 'kk = ( x k - x kk)-, - 7 (54) 

For k ^ l, there is no known way to draw independent 
samples, but ~({x' kl \X) can be well approximated by a 
Gamma distribution by matching the maximum point 
and the second derivative at the maximum. A Gamma 
distribution can be efficiently sampled and we use it as 
proposal density and accept the resulting x' kl with prob¬ 
ability min{l,p acc }. Specifically, our proposal step is: 

x'kl ~ l( x 'ki\X) = Gamma(4;K /3) (55) 

with the parameters 

a = —hv (56) 

0 = -hv 2 (57) 


using 


—b+\/b 2 — 4ac 

v = --- 

2 a 

^ _ _Cfe_|_ Ci _ Cfcj + c ik 

[v+x k -x k i) 2 (v + xi - x k i) 2 v 2 
a = c k + ci — c k i — cik 

b = (Cfe — C k i — Ci k )(xi - X k i) + (c ; — Ckl - Ci k ){x k 
c = — (CfcZ + Ci k )(x k — x k i)(xi — Xkl ) 


(58) 

(59) 

(60) 

- x kl ) 
(61) 

(62) 


which matches the value and the first two derivatives 
of the true marginal density at the maximum (see Ap¬ 
pendix for derivation), and leads to acceptance probabil¬ 
ities close to one for most values of x k i. However, if the 
current value of x k i is in one of the heavy tails of the 
distribution 7 ( 2 ^|X), the acceptance probability can be 
much less than 1 and the Markov chain can get stuck. 
In order to avoid this problem, we utilize a second step 
to generate x' kl : After we sample x' k[ from the proposal 
density (55) we sample x' kl according to: 


log X' k i ~ N (log x 'kl I Xkl, 1) 


(63) 


In summary, the proposed Algorithm [l] is a Metropolis 
within Gibbs MCMC algorithm with adapted proposal 
probabilities for each Gibbs sampling step. For efficiency 
reasons, transition matrix elements (*, j) for which no 
forward- or backward transition counts have been ob¬ 
served, can be neglected in the sampling algorithm, in 
order to account for the effect of the sparse prior. 


Algorithm 1: Reversible sampling algorithm 

Input: C, XX 
Output: A'^ +1 ^ 

for all indexes ( k , l) with k < l and c k i + cik > 0 do 

if k = l then 

Sample x k ^ from (541 

end 

else 


Calculate a and /3 by ( 56|57 1 , and sample x' M 
from Gamma(«, /3). 

— li x 'kl PO Gamma(i/.| \cx.,j3) 

Pace — i( Xkl \x) GammaCa^jjc*,/?) 

Accept x' k i as x k i +1 ' ) with probability min{l,p acc } 
Sample x' kl by \ogx’ kl ~ A7(log x k i , 1). 


P. 


y( x ki A) x 'ki 


l(. x kl\ x ) x kl 

Accept x' k i as x^ 1 ^ with probability min{l,p acc } 


end 

end 


E. A prior for reversible Markov models with fixed 
equilibrium distribution 


As before we will be working with variables Xij = nipij 
related to transition matrix entries Pij via (45). In con¬ 
trast to the previous algorithm, 7 r is not a function of P 
but fixed. Thus the single normalization condition (46) 
is replaced by a condition for each row: 


x ij = 7Tj, (66) 

3 


in order to ensure reversibility with respect to the given 
7 r. 

All x k i in the lower triangle (fc > l ) are used as inde¬ 
pendent variables. Given a valid X matrix, an update 






















that respects the constraints is given by 


as 


13 


Xkl —> x'kl 

(67a) 


f° 

Pkk > 0} Ckk — 0 

Xkk ^ X k k T {xkl X k l) 

(67b) 

bkk = s 

1 ^ 1 + e 

Pkk — O 5 — 0 

Xlk —> x'kl 

(67c) 

1 

l-l 

Ckk ^ 0 

Xu Xu + (x'lk - Xlk). 

(67d) 





(67b I and (67d) ensure that the normalization condition 
( 66 ) holds for the new sample and will thus keep n con¬ 
stant, while (67c) restores the symmetry and thus ensures 
reversibility of P. We will again use the prior (47) defined 
on the set of X matrices and sample from the posterior 
(48). The ideal proposal density of x' kl is 


This choice of prior will again ensure that Cki + ci k = 0 
results in pki = 0 and pik = 0 for all k < l and for all 
posterior samples, a property shared by the reversible 
MLE with fixed stationary vector. This ensures that the 
posterior mass is located around the maximum likelihood 
estimate P and again prevents the occurrence of artificial 
kinetic pathways in the posterior ensemble. 


704 I X) oc(x' kl ) Ckl+Clk+bkl (x kk + x kl - x’ u ) Ckk+bkk 

(Xu + X ki - x' kl ) Cu+bl1 (67e) 

which is the conditional distribution density for given all 
off-diagonal elements of X (except x k i), 7 r and the counts 

C. 

We have seen that a correct choice of prior parameters 
was essential in order to successfully apply the posterior 
sampling for meta-stable systems. As in the reversible 
case we will use b k i = —1 for k > l to enforce Xki = 0 
whenever c k i + Q k = 0 . 

However, the choice of prior counts for the diagonal 
elements b kk is less straightforward. According to our 
experience, a good choice is to determine the value of b kk 
based on the maximum likelihood estimate p kk of the 
fc-th diagonal element as 


bkk 


0 Pkk P 0? Ckk 11 

— 1 else 


( 68 ) 


which ensures that the posterior expectation of p kk is zero 
if and only if p kk = 0 , and the conditional expectation of 

( 1673 ), 


F. Sampling reversible Markov models with fixed 
equilibrium distribution 


We now investigate how to efficiently sample the con¬ 
ditional (67e). Here we assume without loss of generality 
that x kk < Xu and transform x' kl £ ( 0 , x kk + x k {) into a 
new variable v r £ ( 0 , +oo) via 


v kl 


Xkk T X k l X k [ 

The ideal proposal density of v’ is then 


(71) 


lv{v'\X) oc 


dx‘ 


kl 


dv' 


l{x'kl\ X ) 


( /\Ckl-\-Clk+bkl 

= (v ) 


s — 1 


cii+bn 


. (1 + w / ) -(c W +c«*+c fcfc +c I1 +6 w+ 6 M +6 M +2) 

= ( v') ai f + v' J (1 + t; , )“ (ai+ ° 2+ T^2) 


E(x' kl | X) = Ck * + Cl * (xu + x kl ) (69) 
Ckl + Clk + Cn 


matches the MLE of the one-dimensional likelihood func¬ 
tion for x k i given X if p kk > 0 and c kk = 0. (Note that 
for the MLE of P, there is at most one k which satisfies 
p k k > 0 and Ckk = 0 - see proof in Appendix.) 

However, in the case that p kk = 0, the conditional 
(67e) would then degenerate so that x' kk = 0 with prob¬ 
ability one, and the k -th row and column of X would 
remain fixed in the sampling process. This effect can 
break ergodicity in the sampled Markov chain and there¬ 
fore prevent convergence of the algorithm. This problem 
is avoided by regularizing the prior choosing th e prio r pa¬ 
rameter as b k k = — 1 + e for p kk = 0 such that (67e) does 


not degenerate, where e > 0 is a small number. In addi¬ 
tion we need to ensure that the Markov chain is started 
from an initial state A'l°) with x kk > 0. In summary, we 
select the prior of X for reversible sampling with fixed tt 


with 

Xll + Xkl 

s = - 

X k k T Xkl 

ai = Cki + cik + b k i 

^2 — C k k T b k k 

0-3 = Cu + bll 

Like in the previous algorithm, we can approximate the 
conditional of v by a Gamma distribution as: 


7 „(u , |A) « Gamma(u , |a, /3) (73) 


with 


a = —hv (74) 

/3 = -hv 2 (75) 
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and 


V = 

-b+ VW 

— 4ac 


(76) 

2 a 



h — 

ai 

03 

o 2 

(77) 


52 G 

\ 2 

±1+*) 

(1 + u) 2 

a = 

02+1 



(78) 

b = 

02 Or T 

02 + 03 + 1 


(79) 

s — 1 


c = 

s (ai + 1) 
1 — 5 



(80) 


See Appendix for derivation. Then v' can be sampled 
by a Metropolis sampling step with the proposal density 
Gamma(i/|a, 0) and the acceptance ratio 


min{l, p acc } = min 


1 7^(ri / |A)Gamma(t!|a, /3) 

’ 7 „(?;|X)Gamma(?' / |a, /3) 


where v = Xu/xkk denotes the original value of v. In 
addition, in order to avoid the sampler from getting stuck 
at an extremely small or large value of v, we also utilize 
the same strategy as in Section IVD|to generate v' by 


log?/ ~ A/(log?/| logu, 1) 


The proposed Algorithm [2] for sampling of reversible 
transition matrices with fixed stationary vector can again 
be characterized as a Metropolis within Gibbs MCMC 
algorithm with adapted proposal probabilities. 


Algorithm 2: Reversible sampling algorithm with 
fixed stationary vector 


Input: C, 7r, X^ 

Output: X l '-' i+l> 

for all indexes (k , l ) with k < l and Cki + cik > 0 do 

if Xkk < xu then 

y — X kl 

x kk 

s — x ll+ x kl 
x kk+ x kl 

«i = Cki + cik + bki 
an = Ckk + bkk 
03 = cu + bu 

end 

else 

y —— X kl 

X ll 

„ — %kk+Xkl 
xil+Xkl 

«1 = Cki + Cik + bkl 
a 2 = Cu + bu 
03 = Ckk + bkk 


end 


Calculate a and /3 by ( |74| ) and ( |75| ) 
Sample v' from Gamma(a, /3). 

Let x' k i = min {x k k + x k i,x u + x k i} ■ yyq;. 


Pace — 


'y-v ( v' |X)Gamma(i>|a:,/3) 
'y-v (v |X)Gamma(i) / 1 ck,/3) 


using l67al-(67dl 


Accept xu as with probability min{l,p acc 

Sample v' by by logu' ~ A/’Qogv, 1). 


} 


_ n lv (y'\X)v' 

Pacc ~ 7 V (*kl\X)v 

Let x'ki = min{a:fcfc + Xki,xu + x k i} ■ yq)y ■ 

Accept x'ki as xH +1) with probability min{l,p acc } 

end 
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V. RESULTS 
A. Validation 


We first demonstrate the validity of the reversible sam¬ 
pling algorithm for the following 2x2 count-matrix, 


C = 




(81) 



Matrix element, pi 2 Matrix element, pi 2 


In Fig. [fl] we compare the sampled histograms using Al¬ 
gorithm |T] with analytical values for the non-reversible 
posterior with Dirichlet-prior-counts bij = — 1. Any 2x2 
transition matrix automatically fulfills detailed balance, 
and therefore the analytical and sampled densities are 
expected to be equal. The histogram for samples of the 
reversible algorithm are indeed in agreement with the 
analytical posterior. 

In Fig. [7] the sampled histogram for count matrix (81) 
with fixed stationary distribution n = (0.25, 0.75) T us¬ 
ing Algorithm [2] is compared with the exact posterior 
distribution. The detailed balance relation Q with fixed 
stationary vector enforces a linear dependency between 
the transition matrix element P 12 and p 2 i- The resulting 
posterior is therefore restricted to the line nipi 2 = 7r 2 P 21 
such that the projection onpi 2 in Fig. [TJalready contains 
the full information about the one-dimensional posterior. 
A comparison between histogram frequency and analyt¬ 
ical density demonstrates the validity of the algorithm. 


Figure 6. Sampled histogram frequency (a) and analytical 
probability density (b) of reversible posterior for 2x2 count 
matrix. Sampled frequencies are in agreement with the ana¬ 
lytical probabilities. 



Matrix element, P 12 

Figure 7. Sampled histograms and analytical probability den¬ 
sity for reversible posterior with fixed stationary vector for 
2x2 count-matrix. 
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B. Applications 

To demonstrate the usefulness of the proposed algo¬ 
rithms we apply them to molecular dynamics simulation 
data. Here, two systems are chosen to illustrate our 
methods: (1) The alanine-dipeptide molecule, and (2) 
the bovine pancreatic trypsin inhibitor molecule (BPTI). 

We start by discussing the alanine dipeptide results. 
The system was simulated on GPU-hardware using the 
OpenMM simulation package®! using the amber99sb-ildn 
forcefielcpS and the tipSp water modePR The cubic 
box of length 2.7 nm contained a total of 652 solvent 
molecules. We used Langevin equations at T = 300A' 
with a time-step of 2/s. A total of 10/is of simulation 
data was generated. The <p and ip dihedral angles were 
discretized using a 20 x 20 regular grid to obtain a ma¬ 
trix of transition counts C = (c^), here by sampling one 
count per lag time r. Below we will show histograms for 
two important observables, largest implied time-scales ti 
and expected hitting times , t(A —> B ), for pairs A 1 B of 
meta-stable sets. We compute the posterior sample-mean 
and 90% credible intervals for 1 pis of simulation data and 
show that the credible intervals nicely envelop a reference 
value obtained from the MLE transition matrix for the 
total simulation data, supporting the proposed prior as a 
’good’ choice for reversible sampling in meta-stable sys¬ 
tems. 


1. Alanine dipeptide, reversible sampling 


In Fig. [ 8 ] we show histograms of implied time-scales 
computed from a reversible posterior sample. The mean 
values estimated from the posterior sample are in good 
agreement with the reference values. Tables El and |ITT| 
compare the reference values with the sample mean /i 
and sample standard deviation a for each observable. 

In order to gain a first impression of the efficiency of the 
sampling and the quality of our estimates, we compute 
the integrated autocorrelation time t corr for each quan¬ 
tity sampled (here implied timescales and hitting times). 
The error of the sample mean m[f] compared to the true 
mean (/) can then be estimated as 

e = E[(m[/]-(/>) 2 ] = ^ (82) 


where N e g = IV/ (1 + 2 t corr ) is the effective number of 
samples with N the total number of samples. See Ref.^ 
for a thorough discussion. t corr and e are also reported 
in Table (|H|). 

Fig-il -f) show histograms for expected hitting times 
for the three transitions C 5 —> C% x , C 5 —> and 

C 5 —l► an between meta-stable sets in the <p and ip dihe¬ 
dral angle plane. Again, mean values are in good agree¬ 
ment with the corresponding reference values. Table (III) 
summarizes the computed results. The table columns 
again contain the reference value f the mean value g, 



ti/ps 

g/ps 

a/ps 

t/ps 

tcorr 

t2 

1462 

1556 

303 

19.00 

197 

tz 

71 

73 

1 

0.01 

10 

U 

36 

43 

5 

0.06 

7 


Table II. Comparison of reference implied time-scales (ti) with 
mean g and standard deviation a from the reversible posterior 
using N = 10 5 samples, e is the estimated error of the mean g 
and tcorr is the autocorrelation time of the sampled quantity. 



t/ ns 

g/ns 

a/ns 

e/ns 

tcorr 

t(C 5 CT) 

60.4 

56.7 

12.0 

0.77 

202 

t(C 5 -»• a L ) 

43.6 

40.6 

8.3 

0.53 

206 

t(C 5 -¥ ax) 

0.253 

0.250 

0.005 

0.0004 

218 


Table III. Expected hitting times computed from reversible 
posterior using N = 10 5 samples. See Table |il| for definition 
of other symbols. 


the standard deviation < 7 , the estimated correlation time 
tcorr and the error of the mean value, e. 


2. Alanine dipeptide, reversible sampling with fixed 
equilibrium distribution 

Below we report results for sampling with fixed sta¬ 
tionary distribution. The stationary distribution iii was, 
for sake of simplicity, computed using the relative fre¬ 
quencies of state occurrences, 


°ik 

^2j,k C 3 k 


(83) 


It should be noted that a more useful and independent 
source of ir are enhanced sampling simulations targeted 
at rapidly generating a good estimate of the equilibrium 
probabilities alone. See Ref.^ for methods and applica¬ 
tions that combine MD simulations and enhanced sam¬ 
pling simulations in order to efficiently compute rare- 
event kinetics. 

Results are shown in In Table IV and Fig. [9] The 
sample mean is again in good agreement with the refer¬ 
ence value. For the computatio of the reference value 
we use the MLE transition matrix of the full simula¬ 
tion data reversible with respect to the input stationary 
distribution for the posterior sampling. The additional 
constraint imposed by fixing the stationary distribution 
is clearly reflected in smaller standard deviations for all 
shown observables compared to the reversible case. 

Histograms for expected hitting times between meta¬ 
stable sets are shown in Fig. §1 -f). The sample mean is 
again in good agreement with the reference value. Again, 
we summarize our results, c.f. Table [V] 
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Figure 8. a)-c) Implied time-scales, ti. Histograms obtained from reversible posterior sampling. Dashed lines indicate the 
reference value, ti, dotted lines indicate the posterior sample mean, p(ti). MLE and posterior mean are in very good agreement 
for the proposed choice of prior. The 90% credible intervals are the shaded regions in gray. Expected hitting times, t, d)-f). 
Histograms obtained from reversible posterior sampling. Dashed lines indicate the reference value, f, dotted lines indicate the 
posterior sample mean, p(r). In all cases the reference value obtained from a long simulation is clearly compatible with the 
posterior sample (credible interval). 



U/ps 

n/ps 

a/ps 

t/ps 

tcorr 

t2 

1594 

1520 

196 

0.6 

l 

G 

72 

73 

1 

0.003 

l 

U 

38 

41 

3 

0.01 

l 


Table IV. Comparison of reference implied time-scales , (ti), 
with mean p and standard deviation a from the reversible 
posterior using a fixed equilibrium distribution and N = 10 5 
samples, t is the estimated error of the mean p and t cOII is 
the autocorrelation time of the sampled quantity. 




r/ns 

p/ns 

a/ns 

tcorr 

e/ns 

t(C 5 - 

->• C? x ) 

56.0 

54.5 

5.0 

l 

0.02 

t(C 5 

—> oll) 

41.5 

39.5 

5.1 

l 

0.02 

r(C 5 

—► cur) 

0.249 

0.251 

0.003 

l 

9.7- 10“ 6 


Table V. Expected hitting times computed from reversible 
sampling with fixed stationary vector using N = 10 5 samples. 
See Table [TV| for the definition of symbols. 

3. Bovine pancreatic trypsin inhibitor, reversible sampling 

For BPTI, we used the 1 ms simulation generated on 
the Anton supercomputer^. Please refer to that pa¬ 


per for system setup and simulation details. We pre¬ 
pared data as follows: C a atom positions were oriented 
to the mean structure and saved every 10 ns, resulting in 
about 100,000 configurations with 174 dimensio ns. Ti me- 
lagged independent component analysis (TICA) 32 67 ^was 
applied to reduce this 174-dimensional space to the two 
dominant IC’s as a spectral gap was found after the 
second nontrivial eigenvalue, fc-means clustering with 
k = 100 was used to discretize this space. 

Effective count matrices were obtained using the 
method described iiP at a range of lag times up to 2 /zs. 
Fig. [lO] shows the implied relaxation timescales obtained 
from a maximum likelihood estimate with values com¬ 
parable to the Hidden Markov model analysis in RefP^. 
The figure also shows uncertainties computed from a re¬ 
versible transition matrix sampling as described above 
with N = 1000 samples. Only every 20th transition ma¬ 
trix sample was used to compute timescales in order to 
reduce the computational effort to 50 eigenvalue decom¬ 
positions. It is seen that the MLE is nicely contained in 
the 2er (95%) credible interval. The entire transition ma¬ 
trix sampling for Fig. [lOjtook about 12.5 seconds on a 1.7 
GHz Intel Core i7. Given that 8 lag times were used for 
1000 samples of 100 x 100 matrices that contained about 
40% non-zeros, about 2.56 million elements are sampled 
per second, and about 640 full transition matrix sam¬ 
ples are generate per second. Below a more systematic 
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Figure 9. a)-c) Implied time-scales, ti. Histograms obtained from reversible posterior sampling with fixed stationary vector. 
Dashed lines indicate the reference value, ti, dotted lines indicate the posterior sample mean, /r(ti). The 90% credible intervals 
are the shaded regions in gray. d)-f) Expected hitting times r. Histograms obtained from reversible posterior sampling with 
fixed stationary vector. Dashed lines indicate the reference value, f, dotted lines indicate the posterior sample mean, /r(r). 
The reference value obtained from a long simulation is clearly compatible with the posterior sample (credible interval) in all 
cases except the . 


analysis of the computational efficiency is made. 



Figure 10. Implied timescales for a Markov model of bovine 
pancreatic trypsin inhibitor (BPTI). The error bars are 95% 
confidence intervals estimated using the reversible transition 
matrix sampling algorithm described here using transition 
counts as described in Ref. 58 . 
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C. Efficiency 


We compute acceptance probabilities of the 
Metropolis-Hastings steps and compare the statisti¬ 
cal efficiency of the proposed sampling algorithm with 
the algorithm proposed in Refill that uses uniform 
proposal densities. Efficiency is measured in terms of 
achieved autocorrelation times for sampling of transition 
matrices with different sizes. As a representative observ¬ 
able we choose the largest relaxation time-scale f 2 f° r the 
alanine dipeptide molecule and compute autocorrelation 
functions and autocorrelation times from a large sample 
of size N = 10 6 . Two differently fine discretizations were 
used, resulting in n x n-shaped transition matrices with 
n = 258 and n = 1108. 

Fig-El shows autocorrelation functions for the second 
largest relaxation time-scale, £ 2 , for a reversible poste¬ 
rior ensemble. The autocorrelation function for the re¬ 
versible sampling algorithm with posterior adapted pro¬ 
posals shows a much faster decay than the autocorrela¬ 
tion function for the algorithm in RefP^. Table VI com¬ 
pares acceptance probabilities and autocorrelation times. 
The present proposal steps lead to very high acceptance 
probabilities, p > 0.99, for the sampling off-diagonal en¬ 
tries. The main advance, however comes from the fact 
that the step for sampling diagonal transition matrix ele¬ 
ments in RefP^ has suffered from a very poor acceptance 
probability. As that step was the only step that mod¬ 
ified the equilibrium distribution, the sampler in Ref. 39 
has very poor mixing properties. In contrast, our current 
algorithm generates independent samples for the diago¬ 
nal elements, resulting in an acceptance probability of 
p= 1.0. 

The autocorrelation times for the new sampler are 
more than a factor 5 shorter for the small (233 state) 
matrix and more than a factor 13 shorter for the large 
(1108 state) matrix, indicating a much higher efficiency 
of the new approach. The autocorrelation time of the 
new algorithm increases only mildly for matrices of in¬ 
creased dimension, indicating that the present algorithm 
will be useful for very large Markov models. 

Fig. [12] shows autocorrelation functions for reversible 
sampling with fixed stationary vector. The posterior 


algorithm 

n 

Poffdiag 

Pdiag 

tcorr 

old 

233 

0.216 

0.011 

1088.1 

1108 

0.271 

0.005 

3241.9 

new 

233 

0.994 

1.0 

194.7 


1108 

0.995 

1.0 

242.6 


Table VI. Acceptance probability and autocorrelation times 
for old vs. new reversible sampling algorithm, n: number of 
states; p 0 ffdia g , Pdia g : acceptance probabilities for off-diagonal 
and diagonal elements, respectively. t corr : Autocorrelation 
time for the sampling of the slowest relaxation timescale, in 
number of transition matrix sampling steps. 




Figure 11. Autocorrelation function for reversible sampling. 
Sampling of transition matrices with n = 233 states, a), and 
b) with n = 1108 states. Increasing the dimension of the state 
space has only a small effect on the new sampling algorithm. 


algorithm 

n 

Poffdiag 

tcOTT 

old 

233 

0.175 

72.096 

1108 

0.230 

> 10000 

new 

233 

0.752 

2.893 


1108 

0.706 

3.157 


a Autocorrelation function not converged 


Table VII. Acceptance probability and autocorrelation times 
for old vs. new reversible sampling algorithm with fixed sta¬ 
tionary distribution. Symbols as in Table |VT1 


adapted proposals in reversible sampling algorithm with 
fixed stationary distribution again result in a much faster 
decay of the autocorrelation function than th e un iform 
proposals of the algorithm in Ref. 39 . Table VII com¬ 
pares acceptance probabilities and autocorrelation times 
for the two algorithms. For sampling with fixed station¬ 
ary vector there is no sampling step for the diagonal el¬ 
ements. Although the average acceptance probabilities 
are only a factor of 3-4 better for our new algorithm, the 
autocorrelation times are decreased by a factor 35 for the 
small system (233 states) and over a factor 300 for the 
large system (1108 states). Again, there is only a mild 
increase in autocorrelation time when the dimension of 
the sampled space is increased. 
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Figure 12. Autocorrelation function for reversible sampling 
with fixed stationary distribution. Sampling of transition ma¬ 
trices with n = 233 states, a), and b) with n = 1108 states. 
Increasing the dimension of the state space has only a small 
effect on the new sampling algorithm. 


VI. CONCLUSION 


In this work we have described and significantly ex¬ 
tended the state of the art in reversible Markov model es¬ 
timation. Reversible Markov models are expected to nat¬ 
urally arise from molecular dynamics implementations 
that fulfill microscopic reversibility. Reversibility is an 
essential property in order to analyze the equilibrium 
kinetics of a molecule. However, in order to have re¬ 
versibility in a Markov model, it needs to be enforced 
in the estimation procedure. When done correctly, re¬ 
versible estimation does not bias the model but rather 
reduces statistical errors as a result of a smaller number 
of degrees of freedom. 

We have presented minor improvements to an existing 
self-consistent estimation algorithm for reversible Markov 
models. Then we have presented a new and efficient algo¬ 
rithm to estimate reversible maximum likelihood Markov 
models given a fixed equilibrium distribution. 

The main part of the presented work focuses on the 
long-standing problem of Bayesian estimation of the pos¬ 
terior ensemble of reversible transition matrices. Al¬ 
though several algorithms to sample reversible Markov 
models have been presented in the past, they have been 
hampered by three fundamental problems, two of which 
are addressed here: (i) Which prior should be chosen such 
that the posterior is located around the true value rather 
than completely elsewhere? (ii) How should transition 
counts be obtained when time series are correlated and 


not really Markovian at any given lag time r? (iii) How 
can the sampling algorithm be made efficient such that 
also large transition matrices can be sampled in reason¬ 
able time? 

To address problem (i), we develop priors that ensure 
that reference values and sample mean are similar. The 
key property of these priors is that they make the a priori 
assumption that transitions between states that have not 
been sampled in the trajectory in either direction, have 
zero probability. This is a sparse prior, i.e. an improper 
prior enforcing that the sampled transition matrix has 
the same sparsity structure as the maximum likelihood 
estimate and as induced by the observation. In contrast 
to most other priors that have been previously suggested, 
these sparse priors achieve the desired property of cre¬ 
ating errors bars that nicely envelop the reference esti¬ 
mates. 

For problem (ii), we have described the principles of 
how it can be addressed. We suggest that effective count 
matrices are obtained using the concept of statistical inef¬ 
ficients. A separate preprint^ suggests an initial solution 
towards this aim that is successfully applied on simula¬ 
tion data of the bovine pancreatic trypsin inhibitor in the 
present paper. The solution of problem (ii) is still in its 
infancy and needs further investigation.. 

For problem (iii), we present highly efficient Gibbs 
sampling algorithms for reversible transition matrices 
and reversible transition matrices with fixed equilibrium 
distribution. Both methods are demonstrated to have 
acceptance probabilities close to 1 in their individual up¬ 
date steps. Autocorrelation times from samples of the 
slowest relaxation timescale are one or two orders of mag¬ 
nitude shorter than with a previous Gibbs sampling algo¬ 
rithm, indicating a high statistical efficiency of our sam¬ 
pler. 

Implementations of all algorithms described here are 
available in PyEMMAEH as of version 2.0 or later - 
www.pyemma.org . 
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Appendix A: Details for transition matrix sampling 

1. Reversible transition matrix sampling: derivation of 
marginal densities 

We first pick a single element (fc, l) of X (diagonal 
or off-diagonal) and sample it from a proposal density 
x' kl ~ q( x 'kl\ X ) that is scale-invariant with q(x' kl \X) oc 
q(cx' kl \cX) for all c > 0: 


The proposal density of x' kl given X can be expressed as 


<lx( x 'kl\X) = 


(dx\ 


kl 


\dx 


kl 


1 ~ %kl 
(! - x 'kl ) 2 

1 - x k i 

(i-x' kl r\i-x’ k 


q(x' u \x) 

q(x' kl \X) 

1 %kl 


°kl 


4/1* (A 8 ) 


q( x 'ki\ x )> (*>i) = (M) 
,, else 


(Al) 


Therefore, 

Qx ( x ij | X ) = 


1 - x',> 


1 - x'u 


and then renormalize the matrix such that it retains an 
element sum of 1 : 


_ 

(1 - Xu) 

Y~ - q{xu\X') 

1 - Xu 


x -— • Xu\X' 

1 - Xu 


Xj A . 

3 1 - Xu + x\ 


(A2) 


and 


kl 


Since q(x’ kl \X ) is a probability density function, we can 
obtain from its scale-invariance that 


Qx(.Xij \ X ) 
9x(x' kl \X) 


1 - Xu 
1-x' 


kl 


q( x 'kl\ X ) 

q(x k i\X') 


(A9) 


(A10) 


/ ,, 1 f , , 1 The partitial derivatitive of x\j with respec t to Xij for 

Q( cx u\ cX )^ x u = ~ J q{ cx u\ cX )^ ( cx ki) = ~ (*,j) 7 ^ {k,l) can be calculated according to (A 2 ) as 


and 


q(cx' kl \cX) = -q{x’ kl \X) 


(A3) 


(A4) 


dx'ij 

dxij 


1 


1 - x[ 


kl 


1 - Xu + x' k , 1 - Xu 


(All) 


According to Theorem 13.1 irP^, the posterior distribu¬ 
tion P(X|C) is the invariant distribution of the proposed 
update step if we accept X' as the new sample with prob¬ 
ability min{l,p acc } and 


Combining all the above results leads to 

!&£P-&o q{xki\x') i{x'u\x) 


Pace — (1 x kl T x k j ) 


q(x' kl \x) ”i{x H \x) 


(A12) 


with 


Pace — 


P(X') P(CjA') q x { Xij \X') n dx'ij 

P(X) ' P(C|A) ' q^X) ' ff dxn 

(A5) 

where q x (x' kl \X) denotes the proposal density of x\j given 
X. Note X only contains n(n+l)/2 —1 free variables. So 
we select {xij\i > j,(i,j) 7 ^ (i', j')} as the free variable 
set of X, where avy is an arbitrary element of X with 
i' > j' and (*',/) 7 ^ (k ,l). Let us consider each term on 
the right hand side of ( |A5| ). 

From the definition of X ', we have 


(s'fcfc) 


z kk~^~^kk 


l{x'ki\ x ) oc 


{x k -X kk +x' kk ) a k , 

\°kl+ c lk+ b k l 

_ \ X kl) _ 

( x k -x k i+x' kl )° k (xi-x kt + x' kl )°‘, 


k = l 
k ± l 


(A13) 


2. Reversible transition matrix sampling: Efficient 
proposal densities 


a. Diagonals: 


Let us define variable s' = 


P(A') 

P(A') 


1 


1 - Xkl + x'k 


kl 


-) 

Xkl J 


b k i 


- Xkk , . If x'uu is sampled from the proposal den- 

sity "f(x kk \X), the corresponding proposal density of s' 
can be expressed as 


1 ~ X 'kl 

1 %kl 


bo 


*kl 

%kl 


(A 6 ) 


dx 


kk 


ds' 


l{ x 'kk\ X ) 


and 


{ x ' kk y kk {x k -x kk +x' kk ) 


- x Jl k ( x k~ x kk~\~ x kk) Cfc ’ 

mn = 


P(C|X) 


V k kl+Clk {x k -X k l+X k l)~ a k 
(xi 

' (xi-x k i+x k i)~ c ‘ ’ 


k = l 


k y l 


(A7) 


Note that 


Therefore 


x kk — ( x k x kk) 


1-s' 


dx' 


ds 


Y = (xk - x kk ) (1 - s') 


-2 


(A14) 


(A15) 


(A16) 
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and 


s' ~ (x k - x kk ) (1 - s') 2 


(i-«T 


1 - s' 


' i x 'kk) Ckk 1 • ( x k - Xkk + x' kk ) Ck 

Cfcfc —1 

• (1 - s') Ck 


The maximum point can be computed as the root of a 
quadratic equation in the usual way, 


v = 


—b + \Jb 2 — 4ac 
2 a 


(A24g) 


= (s , ) Cfefe_1 (1 - s') 


/\Cfc—Cfcfc — 1 


(A17) with parameters a, b , c given by 


The above equation implies that s' follows the Beta dis¬ 
tribution with parameters c k k and c k — c kk . So we can 
sample s' ~ Beta(c/jfc, c k — c kk ) and get x' kk by the back- 
transform. 

b. Off-diagonals: We consider how to approximate 
^(x' kl \X) with k l and b k i = — 1 by a gamma distribu¬ 
tion density function. Define 

( x ki) Ckl+C ‘ k ~ 1 (Xk - x kl + x' kl )~ Ck (x l - x u + x' kl )~ Cl 

(A18) 

= (x' kl )~ expf{x' kl ). 

(A19) 


The function f(x' kl ) is then given by 


a = c k + ci- c k i- cik (A24h) 

b = (c fc - c ki - ci k )(xi - xu) (A24i) 

+ (ci - Cu - c ik )(x k - Xu) (A24j) 

c = -(cfej + ci k )(x k - x u )(xi - x k i) (A24k) 


The second solution corresponding to (A24g I with neg¬ 
ative sign in front of the square root can be safely ex¬ 
cluded since y is required to be non-negative. 


3. Reversible sampling with fixed stationary distribution: 
efficient proposal densities 


f(x'u) = ( Cu + C ik ) log x'u - C k logOfe - Xu + x'u) 

-Cl logy - Xu +x' kl ). 

(A20) 

We approximate / using a three parameter family of 
functions 

f(x'u\a, B,fo)=a log x’ kl - /3x' kl + fo (A21) 
so that the corresponding approximate j(x' kl \X) is 

l(x'u\X) oc (x'uV 1 exp f(x' kl \a,/3, f 0 ). (A22) 


(A22) is a Gamma distribution with parameters cr, [3 
which can be efficiently sampled®. 

The three parameters a, B, fo are obtained matching 
/ and / up to second derivatives at the maximum point 


v = argma xf(x' kl ). 

This leads to the following linear system 
log v a + v/3 + fo = f(y) 

“ - P = f'(v) = 0 

V 

~ = f"(v) = h 

yZ 

with solution 


(A23) 

(A24a) 

(A24b) 

(A24c) 


and 

h = f"(v) = 


a = —hv 
B = —hv 


„-.2 


fo = f(v) + hv 2 {logv - 1) 


Ck 


Cl 


(A24d) 

(A24e) 

(A24f) 

cm + ci k 


(■ x k - xu + v ) 2 (xi - Xu + v ) 2 


We first prove that there exists a maximum likelihood 
estimate P = \fiij] of the transition matrix which satisfies 
|{fc |Pkk > 0 ,c k k = 0}| < 1. Suppose that X' = [xO] is 
a maximum likelihood estimate of X and there are two 
different indices k, l which satisfy that x' u > x' kk > 0 
and c kk = Cu = 0. We can then construct a new matrix 
X" = [x".\ with 


x'kl = x'u + x' kk 

x' kk = 0 

X k l = X k i + x kk 
x ll = x'u + x'u — x' kk . 

It can be verified that the likelihood of X" is larger than 
X', which implies that X" is also a maximum lielihood 
estimate of X. Repeat the above procedure, we can fi¬ 
nally get an maximum likelihood estimate X of X which 
has at most one k with x kk > 0 and c kk = 0, and the 
corresponding P satisfies that \{k\p kk > 0, c kk = 0}| < 1. 

We will now investigate how to approximate the den¬ 
sity 


/ \ a 3 / \ — (0'i+a2+a3+2) 

lvW\X) OC {v'r ( + VJ f 1 + f/j 

(A25) 

with s > 1 by a Gamma distribution density. 

As in the reversible case we will use the representation 

lv{v'\X) = (i/) -1 exp/(V) (A26) 


and approximate f(v') using the three parameter family, 


f{v'\a, (3, fo), given in (A211 . The resulting approximate 
density, %(v'\X), has tl 


re same nice properties as the one 


from (A22). 
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Parameters a /3 and fo are given by (A24d), (A24e) 
and (A24f). The maximum point v is given by (A24g). 


The parameters a, b , c are, 


a = a 2 + 1 

(A27a) 

, a 2 + a 3 +1 

b = a 2 ai + 

s — 1 

(A27b) 

s(ai + 1) 

1-s 

(A27c) 
















